Lab book: CoV data summary

Liam Brierley (University of Liverpool)
2020-04-23

Numbers of viruses, sequences

Search terms: (spike[Title] OR “S gene”[Title] OR “S protein”[Title] OR “S glycoprotein”[Title] OR “S1 gene”[Title] OR “S1 protein”[Title] OR “S1 glycoprotein”[Title] OR peplomer[Title] OR peplomeric[Title] OR peplomers[All Title] OR “complete genome”[Title]) NOT (patent[Title] OR vaccine OR artificial OR construct OR recombinant[Title])

host recognised no hosts
no spike sequence 64 947
spike sequence available 54 520
Number of sequences per coronavirus heavily skewed, most just have 1:
. Freq
1 454
2 52
3 7
4 5
5 2
6 5
7 1
8 2
10 2
11 1
12 3
13 2
14 1
16 2
17 2
19 2
23 1
26 2
27 1
30 1
31 2
32 1
33 1
35 1
43 1
51 1
54 1
60 1
66 1
71 1
75 1
86 1
150 1
172 1
183 1
361 1
393 1
660 1
679 1
739 1
753 1
844 1
991 1
3514 1
5953 1
childtaxa_name n_seqs
1581 Feline coronavirus 753
1582 Severe acute respiratory syndrome coronavirus 2 844
1583 Middle East respiratory syndrome-related coronavirus 991
1584 Porcine epidemic diarrhea virus 3514
1585 Infectious bronchitis virus 5953

SARS-CoV-2 sequences

Numbers of hosts

Considering only the 574 coronaviruses with available spike protein sequence data…

Number of host species per coronavirus also heavily skewed, as expected:
. Freq
0 520
1 38
2 3
3 4
5 2
6 1
8 1
15 1
18 1
26 1
38 1
48 1
childtaxa_name Hostspp
570 Severe acute respiratory syndrome-related coronavirus 15
571 Alphacoronavirus 1 18
572 Betacoronavirus 1 26
573 Bat coronavirus 38
574 Avian coronavirus 48

Coronaviruses with broadest host range include very wide species that encompass many individual strains..

Number of coronaviruses per host species also heavily skewed:
. Freq
0 1263
1 124
2 20
3 4
4 2
5 2
7 1
23 1
. Freq
1412 mustela putorius 4
1413 rattus norvegicus 4
1414 sus scrofa 5
1415 vicugna pacos 5
1416 homo sapiens 7
1417 rhinolophus sinicus 23

As expected, some commonly studied species (ferret, rat, domestic pig, human), plus livestock (alpaca) and one horseshoe bat, sequences mostly derive from a single study

Number of coronaviruses infecting each host group:

Host groups are mutually exclusive, i.e. primates = non-human primates. Other mammals = misc orders (Proboscidea, Eulipotyphla, Cingulata..)

Not too sure this is very meaningful given how little we know about potential animal hosts of coronaviruses

Sequence data quality

Var1 Freq
complete_spike 0.1479955
partial_spike 0.6415494
whole_genome 0.2104551
complete_spike partial_spike whole_genome
other 1384 440 30574
S 2415 5462 3557
S1 15 5244 0
S2 11 47 0

Excluding partial sequences, summaries of counts of different complete spike protein sequence types per coronavirus (taxid):

0 1 2 3 4 5 6 7 8 10 12 13 14 16 17 19 21 26 27 28 31 33 42 43 44 46 47 53 87 174 213 282 510 709 831 1888
8 312 49 7 5 5 4 1 2 2 1 3 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1

So 5972 individual spike protein sequences across 420 viruses.

NB further complete coding sequences are available that give subunits separately - for S1, 1 virus (Infectious bronchitis virus) and for S2, 2 virus (Infectious bronchitis virus, Porcine epidemic diarrhea virus). Not considering these for now.

Sequence data summaries

Excluding partial sequences, summaries of genomic characteristics per coronavirus (taxid) (i.e. values are averaged within each virus so that each virus represents only one data point):

Only for viruses that have whole genome sequences, mean lengths:

Lengths, ENC, GC content between-coronaviruses in spike and other proteins; within-coronaviruses in spike:


Analysis of Variance Table

Response: length
            Df    Sum Sq Mean Sq F value    Pr(>F)    
taxid      419 372462761  888933  154.54 < 2.2e-16 ***
Residuals 5552  31934775    5752                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fairly consistent sequence lengths of spikes compared to other proteins (expected as pooling all others). Very little within-coronavirus variation.


Analysis of Variance Table

Response: enc
            Df Sum Sq Mean Sq F value    Pr(>F)    
taxid      419  50902 121.484  361.18 < 2.2e-16 ***
Residuals 5552   1867   0.336                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stronger average codon bias in spike than other proteins! Reasonable variation in spike codon biases between-coronaviruses and within some coronaviruses. Human CoV HKU1 more strongly biased than other CoVs.


Analysis of Variance Table

Response: G + C
            Df    Sum Sq Mean Sq F value    Pr(>F)    
taxid      419 149675245  357220   311.7 < 2.2e-16 ***
Residuals 5552   6362848    1146                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GC content slightly lower and slightly more uniform in spikes than in other proteins! Some variation between-coronaviruses, and some variation within-coronaviruses. Human CoV HKU1 and Wencheng shrew Cov more strongly biased than other CoVs.

Mean GC content of spike versus known host range count, labelled as human/nonhuman virus

Not too informative though useful to see which virus is which?

Spike protein composition

Dinucleotide biases do vary in scale - clearly some biases present (TG overrepresented, CG underepresented). But these are pretty consistent between genera.

Reassuring - biases are more extreme at bridge (3-1) dinucleotides as expected. TG, TA, CA overrepresented, GT, GA, AT underepresented. Still sufficient variability to look for signal in

Most obvious thing is use of different stop codons. But otherwise, fairly consistent across genera agaih..

Not convinced amino acid bias is really useful here - it’s just proportion amino acids in the protein sequence, and it’ll be fairly consistent between CoVs..